df <- read.csv("TrainData.csv") |>
na.omit() |>
distinct()4 Analysis
In this analysis we aim to show that QR can achieve similar, if not better, performance to OLS across various metrics. In order to make the comparisons fair, we will compare the 50th quantile QR, which corresponds to the median, to OLS regression as both the median and mean are measures of centrality. The power of QR is that it is able to produce similar results to OLS regression without having to meet the strict assumptions of OLS such as the assumption of normality. In fact, in this data set neither the response variable or predictor variables meet the assumptions of OLS, and therefore regardless of the performance of OLS it is invalid.
4.1 Visualization
4.1.1 Visualizing data
There are many different kinds of predictor variables in this data set. For instance, there are continuous variables like GrLivArea, discrete/coutning variables likr YearBuilt, and categorical variables like HouseStyle. In all cases we cases we can see that the data is not normally distributed, including in the response variable, SalePrice. Thus, the assumptions of OLS are not met so it cannot be used to make predictions on the data. However, for the purposes of comparing the performance of OLS to QR. We will show that QR is able to give similar results for this data set to OLS, and because it does not require the same assumptions as OLS, one can actually use QR in practice for this kind of data, which is more common than normally distributed data in many important fields, like finance and epidemiology.
suppressWarnings({
p1 <- df |> ggplot(aes(x = GrLivArea)) +
geom_histogram(binwidth = 100) +
theme_bw() +
ylab(NULL) +
xlab("Above Ground Area (sq. ft.)")
p2 <- df |> ggplot(aes(x = YearBuilt)) +
geom_histogram(binwidth = 5) +
theme_bw() +
ylab(NULL) +
xlab("Year Built")
p3 <- df |> ggplot(aes(x = HouseStyle)) +
geom_histogram(stat="count") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab(NULL) +
xlab("House Style")
p4 <- df |> ggplot(aes(x = SalePrice)) +
geom_histogram(binwidth = 10000) +
theme_bw() +
ylab(NULL) +
xlab("Sale Price ($)")
grid.arrange(p1, p2, p3, p4, nrow = 2)
})4.1.2 Visualizing quantile regression vs OLS
df |> ggplot(aes(y = SalePrice, x = LotArea)) +
geom_point(size = 0.9) +
geom_smooth(method = lm, se = F, color = "black") +
geom_text(aes(y = 400000, x = 150000, label = "OLS"), color="black") +
geom_quantile(quantiles=0.5, color="red") +
geom_text(aes(y = 470000, x = 90000, label = "50th quantile"), color="red") +
ylab("Sale price ($)") +
xlab("Lot area (Square feet)") +
theme_bw()# df |> ggplot(aes(y = SalePrice, x = GrLivArea)) +
# geom_boxplot()
df |> ggplot(aes(y = SalePrice, x = GrLivArea)) +
geom_point(size = 0.9) +
stat_smooth(method = lm, color = "black") +
geom_text(aes(x = 4150, y = 500000, label = "OLS"), color="black") +
geom_quantile(quantiles=0.25, color="red") +
geom_text(aes(x = 4000, y = 270000, label = "25th quantile"), color="red") +
geom_quantile(quantiles=0.5, color="blue") +
geom_text(aes(x = 4150, y = 400000, label = "50th"), color="blue") +
geom_quantile(quantiles=0.75, color="green") +
geom_text(aes(x = 4000, y = 600000, label = "75th quantile"), color="green") +
xlab("Sale price ($)") +
ylab("Above ground area (Square feet)") +
theme_bw()4.2 Model creation
4.2.1 QR model
qr50 = rq(data=df, SalePrice ~ GrLivArea + LotArea + TotRmsAbvGrd + as.factor(LotShape) + as.factor(Foundation), tau=0.5)
qr50_summary = summary(qr50)
qr50_summary
Call: rq(formula = SalePrice ~ GrLivArea + LotArea + TotRmsAbvGrd +
as.factor(LotShape) + as.factor(Foundation), tau = 0.5, data = df)
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 36326.81296 3853.84854 9.42611 0.00000
GrLivArea 96.66934 4.02708 24.00481 0.00000
LotArea 0.99940 0.32815 3.04561 0.00236
TotRmsAbvGrd -6476.18114 1080.95132 -5.99119 0.00000
as.factor(LotShape)IR2 -5084.13375 7841.20685 -0.64839 0.51684
as.factor(LotShape)IR3 -21074.80675 7616.42154 -2.76702 0.00573
as.factor(LotShape)Reg -11065.07360 2020.92512 -5.47525 0.00000
as.factor(Foundation)CBlock 21252.40678 1709.40460 12.43264 0.00000
as.factor(Foundation)PConc 53311.16094 2618.05941 20.36285 0.00000
as.factor(Foundation)Slab -16867.20619 5378.30454 -3.13616 0.00175
as.factor(Foundation)Stone 14561.54748 13561.64146 1.07373 0.28312
as.factor(Foundation)Wood -2008.81877 9022.14216 -0.22265 0.82384
4.2.2 OLS model
ols = lm(data=df, SalePrice ~ GrLivArea + LotArea + TotRmsAbvGrd + as.factor(LotShape) + as.factor(Foundation))
ols_summary = summary(ols)
ols_summary
Call:
lm(formula = SalePrice ~ GrLivArea + LotArea + TotRmsAbvGrd +
as.factor(LotShape) + as.factor(Foundation), data = df)
Residuals:
Min 1Q Median 3Q Max
-422488 -26194 -805 20461 326538
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.005e+04 7.267e+03 2.759 0.00587 **
GrLivArea 9.893e+01 4.538e+00 21.801 < 2e-16 ***
LotArea 9.173e-01 1.425e-01 6.438 1.64e-10 ***
TotRmsAbvGrd -4.313e+03 1.396e+03 -3.089 0.00205 **
as.factor(LotShape)IR2 -2.009e+03 8.113e+03 -0.248 0.80446
as.factor(LotShape)IR3 -6.936e+04 1.603e+04 -4.328 1.61e-05 ***
as.factor(LotShape)Reg -1.342e+04 2.809e+03 -4.777 1.96e-06 ***
as.factor(Foundation)CBlock 2.094e+04 4.497e+03 4.656 3.52e-06 ***
as.factor(Foundation)PConc 6.679e+04 4.541e+03 14.708 < 2e-16 ***
as.factor(Foundation)Slab -1.426e+04 1.067e+04 -1.336 0.18170
as.factor(Foundation)Stone -3.396e+03 2.021e+04 -0.168 0.86658
as.factor(Foundation)Wood -5.553e+02 2.842e+04 -0.020 0.98441
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 48410 on 1448 degrees of freedom
Multiple R-squared: 0.6315, Adjusted R-squared: 0.6287
F-statistic: 225.6 on 11 and 1448 DF, p-value: < 2.2e-16
4.3 Model evaluation
4.3.1 Mean absolute error
olsMae = mae(predict(ols), df$SalePrice)
olsMae[1] 32186.89
Qr50Mae = mae(predict(qr50), df$SalePrice)
Qr50Mae[1] 31160.69
OLS MAE value: 32186.89.
And QR 50th MAE value: 31160.69.
QR for 50th quantile has a lower MAE therefore it is has more accurate predictions.
4.3.2 Root mean squared error
olsRmse = rmse(predict(ols), df$SalePrice)
olsRmse[1] 48209.34
Qr50Rmse = rmse(predict(qr50), df$SalePrice)
Qr50Rmse[1] 49434.81
OLS RMSE value: 48209.34.
And QR 50th RMSE value: 49434.81.
Since OLS algorithm’s goal is to minimize RMSE, as expected it has a better (lower) value. But QR has a very similar value which shows how well QR model can keep up even if it is not focusing on optimizing RMSE.
4.3.3 Variance of error
ols_summary$df[2][1] 1448
qr50_summary$rdf[1] 1448
The variance of error for OLS: 1448.
The variance of error for QR 50th: 1448.
Both have the same variance of error.
4.3.4 Min/max error
# Min OLS error
format(round(min(ols_summary$residuals), digits=0), scientific=F)[1] "-422488"
# Absolute min OLS error
format(round(min(abs(ols_summary$residuals)), digits=0), scientific=F)[1] "5"
# Max OLS error
format(round(max(ols_summary$residuals), digits=0), scientific=F)[1] "326538"
# Absolute max OLS error
format(round(max(abs(ols_summary$residuals)), digits=0), scientific=F)[1] "422488"
# Min QR 50th error
format(round(min(qr50_summary$residuals), digits=0), scientific=F)[1] "-440106"
# Absolute min QR 50th error
format(round(min(abs(qr50_summary$residuals)), digits=0), scientific=F)[1] "0"
# Max QR 50th error
format(round(max(qr50_summary$residuals), digits=0), scientific=F)[1] "351819"
# Absolute max QR 50th error
format(round(max(abs(qr50_summary$residuals)), digits=0), scientific=F)[1] "440106"
4.3.4.1 OLS
Min OLS error: -422488.
Absolute min OLS error: 5.
Max OLS error: 326538.
Absolute max OLS error: 422488.
4.3.4.2 QR
Min QR 50th error: -440106.
Absolute min QR 50th error: 0.
Max QR 50th error: 351819.
Absolute max QR 50th error: 440106.